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Abstract — Exchangeable random variables form an important 
and well-studied generalization of i.i.d. variables, however simple 
examples show that no nontrivial concept or function classes 
are PAC learnable under general exchangeable data inputs 
Xi,X2,.... Inspired by the work of Berti and Rigo on a 
Glivenko-Cantelli theorem for exchangeable inputs, we propose 
a new paradigm, adequate for learning from exchangeable data: 
predictive PAC learnability. A learning rule £ for a function 
class J? is predictive PAC if for every e, 5 > and each function 
/ G whenever \a\ > s(S,e), we have with confidence 1 — 5 
that the expected difference between f(X n +i) and the image of 
f\a under C does not exceed e conditionally on Xi, X%, . . . , X n . 
Thus, instead of learning the function / as such, we are learning 
to a given accuracy e the predictive behaviour of / at the future 
points Xi(uj), i > n of the sample path. Using de Finetti's 
theorem, we show that if a universally separable function class 
& is distribution-free PAC learnable under i.i.d. inputs, then it 
is distribution-free predictive PAC learnable under exchangeable 
inputs, with a slightly worse sample complexity. 

Index Terms — Exchangeable random variables, de Finetti the- 
orem, predictive PAC learnability. 

I. Introduction 

In the classical theory of statistical learning as initiated 
in lfT31 . ID (see lPT4l for a historical and philosophical per- 
spective) data inputs are traditionally modelled by a sequence 
of i.i.d. random variables (Xi). Generalizating this approach 
usually involves easing the i.i.d. restriction on the sequence 
of inputs, all the while trying to obtain the same conclusions 
as in the classical theory, namely the uniform convergence 
of empirical means and subsequently the PAC learnability of 
a concept or a function class under the usual combinatorial 
restrictions in terms of shattering. For instance, the i.i.d. 
condition can be relaxed to that of being an ergodic stationary 
sequence r ifl2l . p. 9), or a /3-mixing sequence lfl6l . As to 
a-mixing sequences, they are known to result in the same 
PAC learnable function classes under a single distribution [ 17 1, 
although it is still unknown whether uniform convergence 
of empirical means takes place |18|. An interesting recent 
investigation is 1 1 1 J. 

However, at some point this approach hits a wall. Among the 
best studied classes of dependent stationary random variables 
are exchangeable random variables 0; 0, p. 473; (9), |10|. 
A sequence of r.v. (Xi) is exchangeable, if for every finite 
sequence (ii,i2, ■ ■ ■ ,i n ) of integers the joint distributions of 
(X ix ,X i2 ,..., X in ) and of (Xt ,X 2 ,...,X n ) are the same. 



According to the famous De Finetti theorem (6), (7), a 
sequence (Xi) is exchangeable if and only if the joint dis- 
tribution P on is a mixture of product distributions (that 
is, (Xi) is a mixture of a family of i.i.d. random sequences). 

A nice illustration and the most extreme example of an 
exchangeable sequence which is not i.i.d is a sequence of iden- 
tical copies of one and the same random variable, Xi = X, 
i = 1, 2, . . .. The joint distribution of this process is a measure 
supported on the diagonal of the infinite product space f2°°, 
which is clearly a mixture of infinite powers of all Dirac point 
masses on 57. 

Now, it is immediately clear that no nontrivial function class 
# on a domain ft will be PAC learnable under such a data 
input process: almost every sample path x will be constant, 
x — (x, x, x, . . .), thus revealing no information about the 
values of a function / € & away from x. Consequently, if 
we want to be able to learn from exchangeable data inputs, 
the paradigm of learnability itself has to be re-examined. 

A way out was shown by Berti and Rigo in their visionary 
note [2 1 where they prove that the classical Glivenko-Cantelli 
theorem holds for a sequence (Xi) of exchangeable random 
variables if and only if the sequence is i.i.d. At the same 
time, they observe that the classical GC theorem is formally 
equivalent to the statement about the predictive distribution 
being approximated by the observed frequency: 

sup\F n (t,oj) - P(X n+1 <t\\X 1 ,...,X n )(u J )\ 4 0a.s. 
t 

Here F n (t,u) = (1/n) Yh=i -f(-oo,i] ( x i) is the empirical 
mean of the indicator function, and P(-||Xi, . . . ,X n ) is the 
conditional probability. As shown in J2j, in this form the 
statement remains valid if the r.v. (Xi) are exchangeable, and 
the result can be considered as a conditional (or: predictive) 
version of the classical Glivenko-Cantelli theorem. 

Since the uniform Glivenko-Cantelli theorems are at the 
heart of statistical learning, one would think that the approach 
of Berti and Rigo should have consequences for learning from 
exchangeable inputs. We show that this is indeed the case: by 
replacing PAC learnability with predictive PAC learnability, 
one arrives at a new broad paradigm of learnability suited for 
learning under exchangeable inputs. 

Say that a function class is predictively PAC learnable 
under a given class V of exchangeable random processes (X n ) 



if there exists a predictive PAC learning rule for & under V, 
that is, a map C from the sample space S to a hypothesis class 
such that 

P{a: E (|(£(/| ff ) - /)(X n+1 )| ||Xi,X 2 , ...,X n )>e}^0 

uniformly in / £ & and pQ) G P. This is different from PAC 
learnability in that the expected value of \C(f\ a ) — /| is re- 
placed with the conditional expectation given X\ , X2, ■ ■ . , X n . 
If in particular (Xi) are i.i.d., the above definition is a reformu- 
lation of PAC learnability under the family of corresponding 
laws on the domain f2. 

We show that if a function class & is distribution-free PAC 
learnable under the usual assumption that the data sample 
inputs are i.i.d., then & is predictively PAC learnable under 
the class of all sequences of exchangeable data inputs. Our 
results are obtained under the assumption that & is universally 
separable. 

II. Setting for learnability 

Here we review the PAC learnability model Q, B, lfl3ll . 
1 16 1 in order to fix a precise setting. The domain, or instance 
space, = (Q, &/) is a measurable space, that is, a set CI 
equipped with a sigma-algebra of subsets We will assume 
that il is a standard Borel space, that is, a complete separable 
metric space equipped with the sigma-algebra of Borel subsets. 
For intstance, without loss in generality one can always assume 
that ft = K fc is the Euclidean space. 

Denote by 3§(Cl, [0,1]) the collection of all Borel mea- 
surable functions from 51 to [0,1]. A function class & is a 
subfamily of S§{fl, [0, 1]). 

The family P(£l) of all probability measures on (il, &/) is 
itself a measurable space, whose sigma-algebra is generated 
by the functions v 1— > v{A) from P(fl) to R, as A runs over 

In the PAC learning model, a set V of probability measures 
on is fixed. Usually either V — P(Q) is the set of all 
probability measures (distribution-free learning), or V = {/j,} 
is a single measure (learning under a fixed distribution). 

A learning sample is a pair s consisting of a finite subset 
a of f2 and of a function on a. It is convenient to assume 
that elements x\, X2, ■ ■ ■ , x n £ a are ordered, and thus the 
set of all samples (a, r) with \a\ = n can be identified with 
(O x [0, l]) n . For a £ tt n and a function / £ & we will 
denote / \ a the sample obtained by restricting / to a. 

A learning rule is a mapping 
00 

c-. |J n» x [o,i] n ->#(n,[o,i]), 

n=l 

which is measurable with regard to every Borel structure 
induced on [0, 1]) by the distances L l (p), [i£V. 

A learning rule C is consistent if for every / £ & and each 
a £ il n one has 

C{f \a)\a = f \a. 

Consistent learning rules exist for every function class ^ 
under mild measurability restrictions. 



A learning rule C is probably approximately correct (PAC) 
for the function class & under the class of measures V if for 
every e > 

sup sup P{ct£ fl n : E„|£(/ \ a) - f\\ > e} -)■ 

as n —> 00. Here P stands for 

Equivalently, there is a function s(e, 6) (sample complexity 
of C) such that for each / € & and every /i £ V an i.i.d. 
sample a with > s(e, S) points has the property E M |/ — £(/ f 
cr) I < e with confidence > 1 — 8. 

A function class is PAC learnable under V, if there exists 
a PAC learning rule for & under V. 

If V = P(O) is the set of all probability measures, then & 
is said to be (distribution-free) PAC learnable. At the same 
time, learnability under intermediate families of measures on 

has received considerable attention, cf. Chapter 7 in fl6l . 

A closely related concept to that of a PAC learnable class is 
that of a uniform Glivenko-Cantelli function class, that is, a 
function class & such that for each 5, e > one has, whenever 
n > s(S, e), 



sup P < sup 



1 



>e*> <S. 



One also says that c & has the property of uniform convergence 
of empirical means (UCEM property). Here s(5,e) is the 
sample complexity of the uniform Glivenko-Cantelli class 
(which in general has to be distinguished from the sample 
complexity of a learning rule). 

Every uniform Glivenko-Cantelli function class is PAC 
learnable, for instance, every consistent learning rule for & is 
PAC, with the same learning sample complexity. For concept 
classes, the converse is also true, though not for function 
classes in general. 

A function class & is universally separable [12] if it 
contains a countable subfamily with the property that every 
/ £ & is a pointwise limit of a sequence (/„) of functions 
from for each x £ fl, one has f n (x) — > f(x) as n — > 00. 

Notice that in this paper, we only talk of potential learn- 
ability, adopting a purely information-theoretic viewpoint. 

III. Exchangeable variables and de Finetti's 

THEOREM 

De Finetti's theorem, in its classical form (fifl. Ch. IV; Q, 
Th. 7.2) states that a sequence (JQ) of random variables taking 
values in a standard Borel space f2 is exchangeable if and 
only if the joint distribution P of the sequence is a mixture 
of i.i.d. distributions. More precisely, there exists a probability 
measure 77 on the Borel space P(£l) of probability measures 
on £1 (the directing measure) so that 



'vide), 



(i) 



P(£2) 



in the sense that for every measurable function / on f2°° one 
has 



E(/)= / E -(f)r)(d6). 



In this spirit, 9 will denote a (random) element of P(fl), and 
"almost all 6" is to be understood in the sense of directing 
measure 77. 

A slightly different viewpoint, adopted in [9|, is to fix 
a random measure v, that is, a measurable mapping from 
the basic probability space to P(Q). Under this approach, 
de Finetti's theorem can be put in the following, essentially 
equivalent, form. Denote by 3? the tail sigma-field on fi°°. 
Then, conditionally on the sequence (X^) is i.i.d.: 

P(co e -\\£r) = u °° a.s. 

Note that if 9 ^ (, then 9°° and (°° are mutually singular. 
This follows from a remark of Kakutani JgJ, p. 223: fix / with 
Eg(/) 7^ E^(/), then the empirical mean 



1 1 

-S n (/) = i£ 

n n * — ' 

z— 1 



converges at the same time #°°-a.s. to Eg(/) and £°°-a.s. to 
E{(/). This observation helps to understand the decomposition 

The strong law of large numbers for exchangeable variables 
(cf. e.g. flU, Eq. (2.2) on p. 185, also [9|, Proposition 1.4(f)), 
says that 

I . . . ..„ , (2 ) 



n 

almost surely. If P(A) = 1, then a.s. v(A) = 1, that is, for 
almost all 8, one has 9(A) = 1. Thus, the convergence in (O 
takes place 6*-a.s. for almost all 9 E 9. One concludes: 



For a.e. 9, E(f(X 1 )\T) = E e (/) 9 a.s. 



(3) 



Informally, the conditional expectation E(f(Xi)\T) given 
the tail sigma-field is viewed by almost every non-random 
measure 9 as a constant function, identically assuming the 
value E e (/). 

Lemma 3.1: Let X\,X%, ... be a sequence of exchangeable 
random variables taking values in a standard Borel space Vt. 
Then for every measurable function / on il, for all i and all 

j > n: 

E (E(f(X i )\\^)\\X 1) . . .,X n ) = E(f(X J )\\X 1 , . . .,X n ) 

a.s., where & is the tail sigma-field. Consequently, if is a 
countable family of measurable functions, then one has 



V/eSf E(E(f(X l )\\^)\\X 1 ,...,X n ) 

= E(f(X j )\\X 1 , 



, x n ) 



almost surely. 

Proof: Because of exchangeability, one can assume with- 
out loss in generality that i = 1 and j = n + 1. Now it is 
enough to establish the result for indicator functions / = I a 
of some generating family of Borel subsets A C f2, for 
instance, by identifying with K and considering the intervals 
A = (— 00, f]. In this form, the result has been proved in 
Berti and Rigo J2), where a stronger assertion appears as 
formula (7) on p. 389. (Their function F(t,u) is equal a.s. 
to E(7 ( _ 0O)i] (Xi)||^') = P(X 1 < t\\3T), which fact follows 



from the definition of F(t,u) on p. 386, line - 9 as the a.s. 
limit of (l/n)S r n (IV_ tX)it ]) and the strong law of large numbers 
(O). The second claim is immediate. ■ 

IV. Predictive PAC learnability 

Definition 4.1 : Let Xi,X2,-.- be an exchangeable se- 
quence of random variables with values in a standard Borel 
space ft. Denote P the joint distribution on f2°°. We say that a 
learning rule C for a function class & on 57 is predictively PAC 
with sample complexity s(8,e) (under the sequence (Xi)), if 
for every / S & and each e, 5 > 0, whenever n > s(<5, e), 
one has 



P{a: E(|(£(/ 



■/)(^n 



,,X n ) > e} < <5. 

(4) 

If is a family of sequences of exchangeable random vari- 
ables, then we say that a function class & is predictively PAC 
learnable under V if it admits a learning rule C that is predic- 
tively PAC under every exchangeable sequence pQ) G T 3 , with 
the sample complexity uniformly bounded by some function 
s(S,s). Finally, if & is predictively PAC learnable under the 
family of all exchangeable sequences (Xi), we will simply 
say that & is predictively PAC learnable. 

The following theorem is the main result of the article. 
It allows to deduce predictive PAC learnability from the 
distribution-free PAC learnability. The proof bypasses a uni- 
form Glivenko-Cantelli theorem for exchangeable variables. 

Theorem 4.2: Let be a non-trivial universally separable 
function class on a standard Borel space ft which is uniform 
Glivenko-Cantelli (in the classical sense), with the sample 
complexity n = s(5,e). Then & is predictive PAC learnable 
with the sample complexity s(5e,e/2) under the family of all 
sequences of 57-valued exchangeable random variables. 

Proof: For every n, let e n be the smallest e > with the 
property s(0.5, e) < n. Since & is non-trivial, that is, contains 
at least two functions, e„ > 0. Let be a countable dense 
subfamily of & such that every / G & is a pointwise limit of a 
sequence of functions from For every a, the set of samples 
of the form / f <x, / S is clearly dense in the set of samples 
/ [■ cr, / € For this reason, using standard selection 
theorems (e.g. Theorem 5.3.2 in Q), one can construct a 
measurable emprical risk minimization learning rule C on the 
set of samples 

S n (.?) = {(f \a):aeQ n , ft.?}, 

taking values in the countable family and such that for 
every n and each (cr, s) £ S n (&) 

-S n (C(s) \ a - s) < e„. 
n 

Notice that for every n > s(S,e), whenever 5 < 0.5, one 
has £0 < £7 an d so e + eo < 2e. For this reason, and taking 
into account the uniform Glivenko-Cantelli property of J?, for 
every 9 € P($l) and each / € & one has 



P{E 6 (£(f \a)-f)>2e}<6. 



(5) 



Now let / G & and e, 5 > 0. According to Eq. (|3), for a.e. 
9 G P(fi) there is a subset W = W e C fi with 6>(VK) = 1 
and such that for every uj £ W and each g G {/} U J^"', 

E( ff ||^)H=E 9 (.g). 

Let cr n (oj) denote, for short, the sequence of values 
X 1 (u),X 2 (u),...,X n (cu). Define 

A = {u:E(\C(f \ a n M)(X x ) - ||^) (w) < 2e}. 

(6) 

For a.e. 0, one has, #-a.s., 

Anff« = {w:E 9 (|£(/ ra„H)-/|)<2 £ }. (7) 
According to ©, once n > s(<5, e), 

<9(A n W e ) > 1 - S, 

and consequently 

P(A) = [ 9(A) r)(d6) >1-S. 



Because of symmetry, we can replace Xi in the definition (0 
of A with X n+ i. 

Now we are applying Lemma 13.11 to the countable family 
of functions Sf = {/} U {£(/ f ct) : ct 6 Q"}. Conditioning 
on Xi,X2,-.-,X n amounts to integrating with respect to 
the conditional distribution P(du)\\Xi, X2, ■ ■ ■ , X n ). One must 
have 

P{lo: P(A c \\X 1 ,X 2 ,...,X n )(Lj) > 2e}<5e-\ 
We conclude: 



P{aGfr:E(|£(a,/| CT ) 



\X 1 ,X 2 ,...,X n )<2e} 



Remark 4.3: The proof can be modified so that e/2 is 
replaced with e — j n for an arbitrarily sequence j n J, 0. We 
have only chosen e/2 for simplicity. On the other hand, the 
extra factor of e added to S does not make much difference, 
because — unlike the learning precision e — the confidence 
parameter S is well known to be "cheap". 

Corollary 4.4: Let ^ be a universally separable concept 
class on a standard Borel space 57 having finite VC-dimension 
d. Then admits a learning rule which is predictive PAC 
learnable with regard to any sequence of exchangeable data 
inputs, with the sample complexity bound 



f 16d , 16e 8 , 2 8,1 
max i lg , - lg - + - lg - 

I e e e e e 



s(5,e) 



The proof follows from Theorem 14.21 and the sample 
complexity bound for distribution-free PAC learnability ([16|, 
Theorem 7.8), 

... J8d. 8e 4 2 

s{S, e) = max <^ — lg — , - lg - 

e e e 



V. Conclusion 

Predictable PAC learnability of a function class J? allows to 
bound, with high confidence, the probability of misclassifica- 
tion of a value of a classifier function / G & at any future data 
sample Xi(iS), i > n, given the values of / on a multisample 
Xi(ui), X2(u}), . . . , X n (u)). Under this version of learnability, 
the function / G & cannot be learned in general, it is only 
its future values that can be predicted with high confidence. 
For a large number of problems of statistical learning, this is 
apparently sufficient. 

In statistics, exchangeable random variables and de Finetti's 
theorem are at the forefront of an ongoing discussion between 
frequentists and bayesians. (Cf. [3|, p. 475.) There is however 
no need to enter the fray and choose sides, simply because, 
in Vapnik's words fPTl . p. 720, 

"Statistical learning theory does not belong to any 
specific branch of science: It has its own goals, its 
own paradigm, and its own techniques. 
Statisticians (who have their own paradigm) never 
considered this theory as part of statistics". 
Thus, our new approach can be seen just as an addition 
to the classical framework of learning theory, posessing its 
own inner dynamics and putting forward a number of open 
questions. 

Among the most immediate, let us mention the following 
three, all concerning Theorem l4.2l Can one maintain the initial 
sample complexity s(S,£) in the conclusion of the result? 
Does the theorem hold under less restrictive measurability 
assumptions on & than universal separability, for instance, on 
an assumption that ^ is image admissible Souslin ([5|, pages 
186-187)? Can one conclude that & is consistently predictive 
PAC learnable, that is, predictive PAC learnable under every 
consistent learning rule LI 
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